Generalized unknown morpheme guessing for hybrid POS tagging of Korean

نویسندگان

Jeongwon Cha

Gary Geunbae Lee

Jong-Hyeok Lee

چکیده

Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general lexical patterns of Korean morphemes with a posteriori syllable tri-gram estimation. The syllable tri-grams help to calculate lexical probabilities of the unknown morphemes and are utilized to search the best tagging result. In our scheme, we can guess the POS's of unknown morphemes regardless of their numbers and positions in an eojeol, which was not possible before in Korean tagging systems. In a series of experiments using three different domain corpora, we can achieve 97% tagging accuracy regardless of many unknown morphemes in test corpora. 1 I n t r o d u c t i o n Part-of-speech (POS) tagging has many difficult problems to attack such as insufficient training data, inherent POS ambiguities: and most seriously unknown words. Unknown words are ubiquitous in any application and cause major tagging failures in many cases. Since Korean is an agglutinative language, we have unknown morpheme problems instead of unknown words in our POS tagging. The usual way of unknown-morpheme handling before was to guess possible POS's for an unknown-morpheme by checking connectable " This project was supported by KOSEF (teukjeongkicho #970-1020-301-3, 1997). functional morphemes in the same eojeol l (Kang, 1993). I n this way, they could guess possible POS's for a single unknown-morpheme only when it is positioned in the begining of an eojeol. If an eojeol contains more than one unknown-morphemes or if unknown-morphemes appear other than the first position, all the previous methods cannot efficiently estimate them. sO, we propose a morpheme-pattern dictionary which enables us to treat unknownmorphemes in the same way as registered known morphemes, and thereby to guess them regardless of their numbers and positions in an eojeol. The unknown-morpheme handling using the morpheme-pattern dictionary is integrated into a hybrid POS disambiguation. The POS disambiguation has usually been performed by statistical approaches mainly using hidden markov model (HMM) (Cutting et al., 1992; Kupiec. 1992; Weischedel et al., 1993). However. since statistical approaches take into account neighboring tags only within a limited window (usually two or three), sometimes the decision cannot cover all linguistic contexts necessary for POS disambiguation. Also the approaches are inappropriate for idiomatic expressions for which lexical terms need to be directly referenced. The statistical approaches are not enough especially for agglutinative languages (such as Korean) which have usually complex morphological structures. In agglutinative languages, a word (called eojeol in Korean) usually consists of separable single stem-morpheme plus one or more functional morphemes, and the POS tag should be assigned to each morpheme to cope with the complex morphological phenomena. Recently, rule-based approaches are tAn eojeol is a Korean spacing unit(similar to English word) which usually consists of one or more stem morphemes and functional morphemes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean

Most errors in Korean morphological analysis and part-of-speech (POS) tagging are caused by unknown morphemes. This paper presents a syllable-pattern-based generalized unknownmorpheme-estimation method with POSTAG (POStech TAGger), which is a statistical and rule-based hybrid POS tagging system. This method of guessing unknown morphemes is based on a combination of a morpheme pattern dictionary...

متن کامل

Hybrid POS tagging with generalized unknown-word handling

This paper presents POSTAG 1 as a statistical/rule-based hybrid part-of-speech (POS) tagging system with generalized unknown-word handling. The POSTAG integrates morphological analysis with statistical POS disambigua-tion and post rule-based error-correction. The error-correction rules are automatically learned from a tagged corpus and selectively correct standard HMM tagging errors. The morpho...

متن کامل

Multilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗

The aim of this dissertation is to study statistical methods for multilingual word segmentation and POS tagging with high accuracy. Word segmentation and part-of-speech (POS) tagging are fundamental language analysis tasks in natural language processing, and used in many applications. Existence of unknown words is a large problem in these tasks and they need to be properly handled. We attempt t...

متن کامل

Chinese POS Disambiguation and Unknown Word Guessing with Lexicalized HMMs

This article presents a lexicalized HMM-based approach to Chinese part-of-speech (POS) disambiguation and unknown word guessing (UWG). In order to explore word-internal morphological features for Chinese POS tagging, four types of pattern tags are defined to indicate the way lexicon words are used in a segmented sentence. Such patterns are combined further with POS tags. Thus, Chinese POS disam...

متن کامل

Unsupervised Morphology Induction for Part-of-Speech Tagging

In this paper we present an unsupervised morphology induction algorithm that uses Alignment Based Learning (ABL) e. g. (Zaanen, 2001) for hypothesis generation. We show how this algorithm can be used to induce a lexicon and morphological rules for a wide range of natural languages. The resulting morphological rules and structures are optimized during the induction process using a constraint sat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

Generalized unknown morpheme guessing for hybrid POS tagging of Korean

نویسندگان

چکیده

منابع مشابه

Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean

Hybrid POS tagging with generalized unknown-word handling

Multilingual Word Segmentation and Part - of - Speech Tagging : a Machine Learning Approach Incorporating Diverse Features ∗

Chinese POS Disambiguation and Unknown Word Guessing with Lexicalized HMMs

Unsupervised Morphology Induction for Part-of-Speech Tagging

عنوان ژورنال:

اشتراک گذاری